Goto

Collaborating Authors

 differential expression


VCWorld: A Biological World Model for Virtual Cell Simulation

Wei, Zhijian, Ma, Runze, Wang, Zichen, Li, Zhongmin, Song, Shuotong, Zheng, Shuangjia

arXiv.org Artificial Intelligence

Virtual cell modeling aims to predict cellular responses to perturbations. Existing virtual cell models rely heavily on large-scale single-cell datasets, learning explicit mappings between gene expression and perturbations. Although recent models attempt to incorporate multi-source biological information, their generalization remains constrained by data quality, coverage, and batch effects. More critically, these models often function as black boxes, offering predictions without interpretability or consistency with biological principles, which undermines their credibility in scientific research. To address these challenges, we present VCWorld, a cell-level white-box simulator that integrates structured biological knowledge with the iterative reasoning capabilities of large language models to instantiate a biological world model. VCWorld operates in a data-efficient manner to reproduce perturbation-induced signaling cascades and generates interpretable, stepwise predictions alongside explicit mechanistic hypotheses. In drug perturbation benchmarks, VCWorld achieves state-of-the-art predictive performance, and the inferred mechanistic pathways are consistent with publicly available biological evidence.



Rare Genomic Subtype Discovery from RNA-seq via Autoencoder Embeddings and Stability-Aware Clustering

Mezghiche, Alaa

arXiv.org Artificial Intelligence

Unsupervised learning on high-dimensional RNA-seq data can reveal molecular subtypes beyond standard labels. We combine an autoencoder-based representation with clustering and stability analysis to search for rare but reproducible genomic subtypes. On the UCI "Gene Expression Cancer RNA-Seq" dataset (801 samples, 20,531 genes; BRCA, COAD, KIRC, LUAD, PRAD), a pan-cancer analysis shows clusters aligning almost perfectly with tissue of origin (Cramer's V = 0.887), serving as a negative control. We therefore reframe the problem within KIRC (n = 146): we select the top 2,000 highly variable genes, standardize them, train a feed-forward autoencoder (128-dimensional latent space), and run k-means for k = 2-10. While global indices favor small k, scanning k with a pre-specified discovery rule (rare < 10 percent and stable with Jaccard >= 0.60 across 20 seeds after Hungarian alignment) yields a simple solution at k = 5 (silhouette = 0.129, DBI = 2.045) with a rare cluster C0 (6.85 percent of patients) that is highly stable (Jaccard = 0.787). Cluster-vs-rest differential expression (Welch's t-test, Benjamini-Hochberg FDR) identifies coherent markers. Overall, pan-cancer clustering is dominated by tissue of origin, whereas a stability-aware within-cancer approach reveals a rare, reproducible KIRC subtype.


Contextualizing biological perturbation experiments through language

Wu, Menghua, Littman, Russell, Levine, Jacob, Qiu, Lin, Biancalani, Tommaso, Richmond, David, Huetter, Jan-Christian

arXiv.org Artificial Intelligence

High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.


Does your model understand genes? A benchmark of gene properties for biological and text models

Kan-Tor, Yoav, Danziger, Michael Morris, Zohar, Eden, Ninio, Matan, Shimoni, Yishai

arXiv.org Artificial Intelligence

The application of deep learning methods, particularly foundation models, in biological research has surged in recent years. These models can be text-based or trained on underlying biological data, especially omics data of various types. However, comparing the performance of these models consistently has proven to be a challenge due to differences in training data and downstream tasks. To tackle this problem, we developed an architecture-agnostic benchmarking approach that, instead of evaluating the models directly, leverages entity representation vectors from each model and trains simple predictive models for each benchmarking task. This ensures that all types of models are evaluated using the same input and output types. Here we focus on gene properties collected from professionally curated bioinformatics databases. These gene properties are categorized into five major groups: genomic properties, regulatory functions, localization, biological processes, and protein properties. Overall, we define hundreds of tasks based on these databases, which include binary, multi-label, and multi-class classification tasks. We apply these benchmark tasks to evaluate expression-based models, large language models, protein language models, DNA-based models, and traditional baselines. Our findings suggest that text-based models and protein language models generally outperform expression-based models in genomic properties and regulatory functions tasks, whereas expression-based models demonstrate superior performance in localization tasks. These results should aid in the development of more informed artificial intelligence strategies for biological understanding and therapeutic discovery. To ensure the reproducibility and transparency of our findings, we have made the source code and benchmark data publicly accessible for further investigation and expansion at github.com/BiomedSciAI/gene-benchmark.


SeqMate: A Novel Large Language Model Pipeline for Automating RNA Sequencing

Mondal, Devam, Inamdar, Atharva

arXiv.org Artificial Intelligence

RNA sequencing techniques, like bulk RNA-seq and Single Cell (sc) RNA-seq, are critical tools for the biologist looking to analyze the genetic activity/transcriptome of a tissue or cell during an experimental procedure. Platforms like Illumina's next-generation sequencing (NGS) are used to produce the raw data for this experimental procedure. This raw FASTQ data must then be prepared via a complex series of data manipulations by bioinformaticians. This process currently takes place on an unwieldy textual user interface like a terminal/command line that requires the user to install and import multiple program packages, preventing the untrained biologist from initiating data analysis. Open-source platforms like Galaxy have produced a more user-friendly pipeline, yet the visual interface remains cluttered and highly technical, remaining uninviting for the natural scientist. To address this, SeqMate is a user-friendly tool that allows for one-click analytics by utilizing the power of a large language model (LLM) to automate both data preparation and analysis (differential expression, trajectory analysis, etc). Furthermore, by utilizing the power of generative AI, SeqMate is also capable of analyzing such findings and producing written reports of upregulated/downregulated/user-prompted genes with sources cited from known repositories like PubMed, PDB, and Uniprot.


Modelling Cellular Perturbations with the Sparse Additive Mechanism Shift Variational Autoencoder

Bereket, Michael, Karaletsos, Theofanis

arXiv.org Machine Learning

Generative models of observations under interventions have been a vibrant topic of interest across machine learning and the sciences in recent years. For example, in drug discovery, there is a need to model the effects of diverse interventions on cells in order to characterize unknown biological mechanisms of action. We propose the Sparse Additive Mechanism Shift Variational Autoencoder, SAMS-VAE, to combine compositionality, disentanglement, and interpretability for perturbation models. SAMS-VAE models the latent state of a perturbed sample as the sum of a local latent variable capturing sample-specific variation and sparse global variables of latent intervention effects. Crucially, SAMS-VAE sparsifies these global latent variables for individual perturbations to identify disentangled, perturbation-specific latent subspaces that are flexibly composable. We evaluate SAMS-VAE both quantitatively and qualitatively on a range of tasks using two popular single cell sequencing datasets. In order to measure perturbation-specific model-properties, we also introduce a framework for evaluation of perturbation models based on average treatment effects with links to posterior predictive checks. SAMS-VAE outperforms comparable models in terms of generalization across in-distribution and out-of-distribution tasks, including a combinatorial reasoning task under resource paucity, and yields interpretable latent structures which correlate strongly to known biological mechanisms. Our results suggest SAMS-VAE is an interesting addition to the modeling toolkit for machine learning-driven scientific discovery.


MACHINE LEARNING FOR BIOMARKER DISCOVERY USING NGS DATA - CBIRT

#artificialintelligence

In this article, we will be exploring how machine learning (ML) techniques can be used to identify and analyze biomarkers from next generation sequencing (NGS) data. Biomarkers are specific biological molecules or characteristics that can be used to identify the presence or severity of a particular disease or condition. They play a crucial role in medical diagnosis, treatment, and prognosis, and their discovery and validation is an important area of research in the field of biomedical science. Next Generation Sequencing is a powerful tool that allows researchers to analyze large amounts of genetic data quickly and accurately. By combining the capabilities of machine learning with next generation sequencing data, we can unlock the potential to identify and validate new biomarkers that can improve our understanding of diseases and lead to more effective treatments.


Spatiotemporal transcriptomic divergence across human and macaque brain development

Science

Improved understanding of how the developing human nervous system differs from that of closely related nonhuman primates is fundamental for teasing out human-specific aspects of behavior, cognition, and disorders. The shared and unique functional properties of the human nervous system are rooted in the complex transcriptional programs governing the development of distinct cell types, neural circuits, and regions. However, the precise molecular mechanisms underlying shared and unique features of the developing human nervous system have been only minimally characterized. We generated complementary tissue-level and single-cell transcriptomic datasets from up to 16 brain regions covering prenatal and postnatal development in humans and rhesus macaques (Macaca mulatta), a closely related species and the most commonly studied nonhuman primate. We created and applied TranscriptomeAge and TempShift algorithms to age-match developing specimens between the species and to more rigorously ...


Points of Significance: Statistics versus machine learning

#artificialintelligence

To compare traditional statistics to ML approaches, we'll use a simulation of the expression of 40 genes in two phenotypes ( /). Mean gene expression will differ between phenotypes, but we'll set up the simulation so that the mean difference for the first 30 genes is not related to phenotype. The last ten genes will be dysregulated, with systematic differences in mean expression between phenotypes. To achieve this, we assign each gene an average log expression that is the same for both phenotypes. The dysregulated genes (31–40, labeled A–J) have their mean expression perturbed in the phenotype (Figure 1a).